This task works with an example dataset of old-fashioned Dutch boy names. Just because the are short and, in my humble opinion, also because they sound quite cool. If you are working with your own data, then you need a dataframe with at least two columns:
# Import some packages and create an example dataset
import warnings
warnings.filterwarnings('ignore') # only use this when you know the script and want to supress unnecessary warnings
# Create a dataframe
import pandas as pd
dict={'year':['1950', '1951', '1952', '1953', '1954'],'text':['Cees Aart Arie Jan Otto Gijs Sef Toon',
'Cees Aart Arie Jan Otto Gijs Sef Toon Cees Aart Arie Jan Otto Gijs Sef Toon',
'Aart Arie Toon',
'Jan Otto',
'Gijs']}
df=pd.DataFrame(dict,index=['0', '1', '3', '4', '5'])
# in Jupyter Notebooks, you just call the name of a dataframe (e.g. "df") in the bottom of a cell to print it
df
| year | text | |
|---|---|---|
| 0 | 1950 | Cees Aart Arie Jan Otto Gijs Sef Toon |
| 1 | 1951 | Cees Aart Arie Jan Otto Gijs Sef Toon Cees Aar... |
| 3 | 1952 | Aart Arie Toon |
| 4 | 1953 | Jan Otto |
| 5 | 1954 | Gijs |
# Convert the date string in the column "year" (e.g. 1950) to a Pandas datetime object
df['datetime'] = pd.to_datetime(df['year'], errors = 'coerce')
df
| year | text | datetime | |
|---|---|---|---|
| 0 | 1950 | Cees Aart Arie Jan Otto Gijs Sef Toon | 1950-01-01 |
| 1 | 1951 | Cees Aart Arie Jan Otto Gijs Sef Toon Cees Aar... | 1951-01-01 |
| 3 | 1952 | Aart Arie Toon | 1952-01-01 |
| 4 | 1953 | Jan Otto | 1953-01-01 |
| 5 | 1954 | Gijs | 1954-01-01 |
# Word counts for 'term of interest' per year
# You can resample by year, month, or day,i.e. 'A-DEC', 'M', or 'D'
# For example, if you have daily time series data, you can aggregate oberservation by day, month or year.
# In this example, we work with yearly time series data, so we can aggregate observations by year.
df['term_of_interest'] = df['text'].str.count('Aart*')
df_word = df.set_index('datetime').resample('A-DEC')['term_of_interest'].sum()
df_word = df_word.reset_index()
print(df_word.sum())
df_word
term_of_interest 4 dtype: int64
| datetime | term_of_interest | |
|---|---|---|
| 0 | 1950-12-31 | 1 |
| 1 | 1951-12-31 | 2 |
| 2 | 1952-12-31 | 1 |
| 3 | 1953-12-31 | 0 |
| 4 | 1954-12-31 | 0 |
# Optional: use a Plotly bar chart to plot the data
# To run this code, you need to install the Plotly package first (see the How to get started with Python page in my Blog section)
import plotly.express as px
fig = px.bar(df_word, x='datetime', y='term_of_interest')
fig.update_layout(showlegend=False,
xaxis_rangeslider_visible=False,
width=450,
height=450)
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)')
fig.update_xaxes(title_text="Year", showgrid=True, gridwidth=0.3, gridcolor='LightGrey')
fig.update_yaxes(title_text="# Reference to term of interest", showgrid=True, gridwidth=0.3, gridcolor='LightGrey')
fig.show()